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Abstract 

The problem of finding a center string that is 'close' to every given string arises and 
has many applications in computational molecular biology and coding theory. 

This problem has two versions: the Closest String problem and the Closest Substring 
problem. Assume that we are given a set of strings S = {si, S2, ■ ■ ■ , s n } of strings, say, 
each of length m. The Closest String problem |l || [|, |ll[ asks for the smallest d 
and a string s of length m which is within Hamming distance d to each Si € S. This 
problem comes from coding theory when we are looking for a code not too far away 
from a given set of codes || . The problem is NP-hard ||, [Tl| . Berman et al Q give 
a polynomial time algorithm for constant d. For super-logarithmic d, Ben-Dor et al 
Jl[] give an efficient approximation algorithm using linear program relaxation technique. 
The best polynomial time approximation has ratio | for all d, given by [ p"l| and 
The Closest Substring problem looks for a string t which is within Hamming distance 
d away from a substring of each Si . This problem only has a 2 — 2 |s]+i approximation 
algorithm previously [ |TT[ ] and is much more elusive than the Closest String problem, but 
it has many applications in finding conserved regions, genetic drug target identification, 
and genetic probes in molecular biology g § [H], [TJ, |lj, [TJ, [20| H H HI " whether 
there are efficient approximation algorithms for both problems are major open questions 
in this area. 

We present two polynomial time approxmation algorithms with approximation ratio 
1 + e for any small e to settle both questions. 



*Some of the results in this paper have been presented in Proc. 31st ACM Symp. Theory of Computing, 
May, 1999 [JL2J, and in Proc. 11th Symp. Combinatorial Pattern Matching, June, 2000, MM. 
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1 Introduction 



Many problems in molecular biology involve finding similar regions common to each se- 
quence in a given set of DNA, RNA, or protein sequences. These problems find applications 
in locating binding sites and finding conserved regions in unaligned sequences ]2(1 []. [8|, |T^] , 
genetic drug target identification [pJl , designing genetic probes [11], universal PCR primer 
design ju], |3|, [17], 11], and, outside computational biology, in coding theory [Q, ||]. Such 
problems may be considered to be various generalizations of the common substring prob- 
lem, allowing errors. Many objective functions have been proposed for finding such regions 
common to every given strings. A popular and most fundamental measure is the Ham- 
ming distance. Other measures, like the relative entropy measure used by Stormo and his 
coauthors H may be considered as generalizations of Hamming distance, requires different 
techniques, and is considered in p3| . 

Let s and s' be finite strings. Let d(s, s') denote the Hamming distance between s and 
s 1 . \s\ is the length of s. s[i] is the i-th character of s. Thus, s = s[l]s[2] ...s[|s|]. The 
following are the problems we study in this paper: 

Closest String: Given a set S = {s\, S2, ■ ■ ■ , s n } of strings each of length m, find a center 
string s of length m minimizing d such that for every string Si € S, d(s,Si) < d. 

Closest Substring: Given a set S = {si, S2, ■ ■ ■ , s n } of strings, and an integer L, find 
a center string s of length L minimizing d such that for each s 8 E <S there is a length L 
substring ti of S{ with d(s,ti) < d. 

Closest String has been widely and independently studied in different contexts. In 
the context of coding theory it was shown to be NP-hard . In DNA sequence related top- 
ics, [Q] gave an exact algorithm when the distance d is a constant, [jl], ||] gave near-optimal 
approximation algorithms only for large d (super-logarithmic in number of sequences) ; how- 
ever the straightforward linear programming relaxation technique does not work when d is 
small because the randomized rounding procedure introduces large errors. This is exactly 
the reason why |5|, [llj analyzed more involved approximation algorithms, and obtained the 
ratio | approximation algorithms. Note that the small d is key in applications such as 
genetic drug target search where we look for similar regions to which a complementary drug 
sequence would bind. It is a major open problem [f|, [2], |l|, || [llj to achieve the best ap- 
proximation ratio for this problem. (Justifications for using Hamming distance can also be 
found in these references, especially ||11|| .) We present a polynomial approximation scheme 
(PTAS), settling the problem. 

Closest Substring is a more general version of the Closest String problem. Ob- 
viously, it is also NP-hard. In applications such as drug target identification and genetic 
probes design, the radius d is usually small. Moreover, when the radius d is small, the center 
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strings can also be used as motifs in repeated-motif methods for multiple sequence alignment 
problems 0, 1£, 18, 21, [23|| , that repeatedly find motifs and recursively decompose the 



sequences into shorter sequences. A trivial ratio-2 approximation was given in Jl 1[ J . We 
presented the first nontrivial algorithm with approximation ratio 2 — 2 | x^+i , in [|TJ] . This is 
a key open problem in search of a potential genetic drug sequence which is "close" to some 
sequences (of harmful germs) and "far" from some other sequences (of humans) . The prob- 
lem appears to be much more elusive than Closest String. We extend the techniques 
developed for closest string here to design a PTAS for closest substring problem when d 
is small, i.e., d < 0(logN), where N is the input size of the instance. Using a random 
sampling technique, and combining our methods for Closest String, we then design a 
PTAS for Closest Substring, for all d. 

2 Approximating Closest String 

In this section, we give a PTAS for Closest String. We note that a direct application of 
LP relaxation in pf does not work when the optimal solution is small. Rather we extend 



an idea in [11] to do LP relaxation only to a fraction of the bits. Let S = {s\, s 2 , ■ ■ ■ , s n } 
be a set of n strings each of length m. 

The idea is as follows. Let r be a constant. If we choose a subset of r strings from S, 
consider the bits that they all agree. Intutively, we can replace the corresponding bits in 
the optimal solution by these bits of the r strings, and this will only slightly worsen the 
solution. Lemma [l| shows that this is true for at least one subset of r strings. Then all we 
need to do is to optimize on the positions (bits) where they do not agree, by LP relaxation 
and randomized rounding. 

We first introduce some notations. Let P = ■ ■ ■ ,jk} oe a set (multiset) and 

1 < ji < j2 < • • • < jk <m. P is called a position set (multiset). Let s be a string of length 
m, then s\p is the string s\ji] s[j2\ ■ ■ ■ s[jk\- 

For any k > 2, let 1 < < n be distinct numbers. Let Qi lt i 2 ,...,i k be the 

set of positions where s^, Sj 2 , . . . , si k agree. Obviously \Qii,i 2 ,...,i h \ > m — kd opt . Let po = 
maxKjjXn d(si, Sj)/d op t- The following lemma is the key of our approximation algorithm. 

Lemma 1 If po > 1 + 2 r-\ > then for any constant r , there are indices 1 < i%, i 2 , ■ ■ ■ , i r < n 
such that for any 1 < I < n, 

d( s l\Qi 1: i 2 ,...,i r ) s h Iq^.jj,...,^) — £ H s il<2i 1 ,i2,...,v ' s \Qi 1 ,i 2 ,...,i r ) — 2r — \ ^ opt ' 

Proof. Let Pi u i 2 ,...,i k be the number of mismatches between s^ and s at the positions in 
Qil,i2,...,ik- Let Pk = ^■^■i<i 1 ,i 2) ...,i k <nPi 1 ,i 2 ,...,i k /dopt- First, we prove the following claim. 
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Claim 2 For any k such that 2 < k < r, where r is the constant in the algorithm closest- 
String, there are indices 1 < i\, i%, . . . , i r < m such that for any 1 < I < n. 

10 G Qh,i 2 ,...,i r I Siib'] + s l[j] and s h[j] + s ij}}\ < (Pk ~ Pk+i)d op t 

Proof. Consider indices 1 < i\, 12, ■ ■ ■ , ik < m such that Pi u i 2 ,...,i k = Pkdopt- Then for 
any 1 < ik+i, ik+2, • • • , V < rn and 1 < I < n, we have 

10' G Qii,i2,...,ir I a*ib"] + s l[j] and s ubl + s \iW\ 

< 10 e Qh,i 2 ,...,i k \ siibl ^ s *b1 and »*i [7] / s b']}l (!) 
= 10 e Qh,i2,...,i k I s u bl ^ s b']} - e Qh,i 2 ,...,i k I «u bl = «zb'] and s h bl ^ s b']}| 

= 10 e Qi u i 2 ,...,i k I s ix bl / sb']} - e Q iui2 ,...,i k ,i \ s h [j] + s[j]}\ 

= Ph,i 2 ,---,ik ~ Pil,i 2 ,—,ik,l (^) 

< - Pfc+l) ^opt, 

where Inequality (|T|) is from the fact that Qi lt i 2 ,...,i r C Qi^,...,^ and Equality (Q) is from 
the fact that Qi lt i 2r .. t i kj i C Q ilji2j _ t i k . □ 

Claim 3 min-j^o - 1, p 2 - P3, P3 ~ P4,---,Pr~ Pr+i} < 2F=T- 

Proof. Consider 1 < i,j < n such that d(si,Sj) = pod op t- Then among the positions 
where Sj mismatches s^, for at least one of the two strings, say, Si, the number of mismatches 
between Si and s is at least pod op t/2. Thus, among the positions where Sj matches Sj, the 
number of mismatches between Si and s is at most (1 — ^-)d op t. Therefore, P2 < 1 — ^r- So, 

i(/) - 1) + (>2 - P3) + (P3 - P4) + ' • • + (p r ~ Pr+l) \pQ + p2 ~ § 1 



i + r-1 ~ r-i - 2r - 1 
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Thus, at least one of po — 1, P2 — P3, P3 — Pi, ■ ■ ■, Pr — Pr+i is less than or equal to 2f~r- 1=1 
If po > 1 + 2F^T' them from Claim ||, there must be a 2 < k < r such that p^ — Pk+i < 
^— j . From Claim [2|, 

10 e Qh,i 2 ,...,i r \ s h\J] ¥= s/bl and s nb1 ^ s[i]}| < 2r ~ 1 <V • 

Hence, there are at most 2r -i dopt bits in Qi 1: i 2: ...,i r where si differs from s^ while agrees 
with s. The lemma is proved. I 

Lemma [l] hints us to select r strings Sj-i, Sj 2 , . . . , Si r from <S at a time and use the unique 
letters at the positions in Qi x % 2 ...i T as an approximation of the optimal center string s. For 
the positions in P^^,...,^ = {1, 2, . . . , L} — Qi 1: i 2: ... : i r , we use ideas in [11|, i.e., the following 
two strategies: (1) if \Pi lt i 2 ... ij is small, i.e., d < O(logL), we can enumerate |£|'' F Vi' < 2>-' < *-' 
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possibilities to approximate s; (2) if |-F* l5 *2,---,ii- 1 ^ s large, i.e., d > O(logL), we use the LP 
relaxation to approximate s. The details are found in Lemma |(| Before presenting our 
main result, we need the following two lemmas, where Lemma |4| is commonly known as 
Chernoff's bounds (|fl5||, Theorem 4.2 and 4.3): 



Lemma 4 [15] Let X\,X2, ■ ■ ■ ,X n be n independent random 0-1 variables, where Xi takes 
1 with probability pi, < pi < 1. Let X = Y^i=i Xi, and u = E[X]. Then for any 5 > 0, 

(1) Pr(X > (l + 8)n) < 



(2) Pr(X < (1 - < exp (-^<5 2 ) • 
From Lemma |], we can prove the following lemma: 
Lemma 5 Let Xi, X and n be defined as in Lemma [|. Then for any < e < 1, 

(1) Pr(X > n + en) < exp (-|ne 2 ) , 

(2) Pr(X < {j, - en) < exp (-|ne 2 ) . 
Proof. (1) Let 5 = f. By Lemma | 



Pr(X > ii + en) < 



e m 



(1 + f) 



(i + f) {1+ ^ 



< 



L(l + e) 



where the last inequality is because u < n and that (1 + x)( 1+ z-* is increasing for x > 0. It 



is easy to verify that for < e < 1, 



(i+e) 1 



T — ex P 



Therefore, (1) is proved. 



(2) Let 8 = By Lemma |, (2) is proved. ■ 
Now, we come back to the approximation of s at the positions in Pi li i 2v .. i i r . 

Lemma 6 Let S = {s±, S2, ■ ■ ■ s n }, where |sj| = m for all i. Assume that s is the optimal 
solution of Closest String and maxi<j< ra d(si, s) = d op t- Given a string s' and a position 
set Q of size m — O(d op t) such that for any i = 1, . . . , n 



d{si\Q,s'\ Q ) - d(si\ Q ,s\ Q ) < pd opt , 



(3) 



where < p < 1, one can obtain a solution with cost at most (1 + p + e)d op t in polynomial 
time for any fixed e > 0. 

Proof. Let P = {1, 2, . . . , m} — Q. Then, for any two strings x and x' of length m, 
we have d(x\p, x'\p) + d(x\Q, x'\q) = d(x, x'). Thus for any i = 1, 2, . . . , n, 



d(si\ P , s\ P ) = d(si, s) - d(si\Q, s\q) < (1 + p) d opt - d{si\ Q , s'\q). 
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Therefore, the following optimization problem 

f min d- (4) 

[ d(si\ P ,x) < d- d{Si\Q,s \q), i = 1, • • • ,n; \x\ = \P\, 

has a solution with cost d < (1 + p)d op t- Suppose that the optimization problem has an 
optimal solution x such that d = do. Then 

do < (1 + p)dop t . (5) 

Now we solve (|j) approximately. Similar to [ffl, [llj, we use a 0-1 variable to indicate 
whether x[j] = a. Denote x( s i\j]) a ) = if Si[j] = a and 1 if Si[j] ^ a. Then @ can be 
rewritten 0-1 optimization problem as follows: 

min d; 

EaS^>=l, j = l,2,...,|P|, (6) 

, T,i<j<\p\Y,aexX(si[j},a)x jta < d - d( Si \Q,s'\ Q ), i = l,2, ...,n. 

Solve @ by linear programming to get a fractional solution Xj^ a with cost d. Clearly d < do . 
Independently for each < j < \P\, with probability set = 1 and Xj a i = for any 
a' ^ a. Then we get a solution xj a for the 0-1 optimization problem, hence a solution x for 
(|4|). It is easy to see that X^aes x( s i[j] 

, cl) Xj^ a takes 1 or randomly and independently for 
different j's. Thus d(si\p,x) = J2i<j<\p\J2aeY,x( s i[j]i a ) x j,a is a sum of \P\ independent 
0-1 random variables, and 

E[d(si\ P ,x)] = H x(si[j],a)E[x j}a ] 

l<j<\P\ fl GS 

Yx{st[i],a)x jA 
1<J<|-P] aes 

< d - d(si\ Q ,s'\ Q ) < d - d(si\ Q ,s'\ Q ). (7) 
Therefore, for any fixed e' > 0, by Lemma || 

Pr (d(si\p,x) >d + e'\P\ - d(si\Q, s'\q)) < exp ( --e /2 |P 



Considering all sequences, we have 



1/2, 



Pr (d(si\p, x) > do + e'\P\ — d(sj|g, s'\q) for at least one i) < n x exp f — -e' |P| 

If |P| > (41nn)/e /2 , then, n x exp ^— |e' 2 |P|^ < n^s. Thus we obtain a randomized 
algorithm to find a solution for (||) with cost at most do + e'[P| with probability at least 
1 — n~3. The above randomized algorithm can be derandomized by standard method of 
conditional probabilities JD|]. 
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If \P\ < (41nn)/e /2 , < n( 41n ' s ')/ e ' is a polynomial of n. So, we can enumerate 

all strings in S' p l to find an optimal solution for (Q). Thus, in both cases, we can obtain a 
solution x for the optimization problem (0) with cost at most do + e '|-Fl hi polynomial time. 
Since \P\ = 0(d op t), \P\ < c x <i op i for a constant c. Let e' = - and s* = R(s',x,P). From 
Formula (Q), 

d(sj,s*) = d(si|p,s*|p) + d(si|Q,s*| Q ) 
= d(si|p,x) +d(sj|Q,s'|Q) 
< d + e'|-P| < (1 + P)d op t + ed op t, 

where the last inequality is from Formula (||). This proves the lemma. □ 
Now we describe the complete algorithm in Figure [l]. 



Algorithm closestString 

Input si, s 2 , . . . , s n G S m . 
Output a center string s £ S m . 

1. for each r-element subset {s^, Sj 2 , . . ., Si r } of the n input strings do 

(a) Q = {1 < j < m\s h [j] = s i2 [j] = ... = s ir [j]}, P = {1,2,... ,m} - 
Q. 

(b) Solve the optimization problem defined by Formula (H) as described 
in the proof of Lemma ^ to get an approximate solution x of length 
\P\. 

(c) Let s' be a string such that s'\q = SiJq and s'\p = x. Calculate the 

cost of s' as the center string. 

2. for i = 1, 2, . . . , n do 

calculate the cost of as the center string. 

3. Output the best solution of the above two steps. 

Figure 1: Algorithm for CLOSEST STRING 

Theorem 7 The algorithm closestString is a PTAS for Closest String. 

Proof. Given an instance of Closest String, suppose s is an optimal solution and 
the optimal cost is d op t, i.e. d(s, Si) < d op t for all i. Let P be defined as step 1(a) of Algo- 
rithm closestString. Since for every position in P, at least one of the r strings , s« 2 , . . . , Si r 
conflict the optimal center string s, so we have |P| < r x d op t- As far as r is a constant, 
step 1(b) can be done in polynomial time by Lemma |6|. Obviously the other steps of 
Algorithm closestString runs in polynomial time, with r as a constant. 
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If po — 1 < 27rr> then by the definition of po, it is easy to see that the algorithm finds 
a solution with cost at most pod op t < (1 + 2^r[)dopt m s * e P 2- 

If po > 1 + 2r _i , them from Lemma [j] and Lemma ||, the algorithm finds a solution with 
cost at most (1 + ^F^l + e )dopt- This proves the theorem. □ 

3 Approximating Closest Substring when d is small 

In some applications such as drug target identification, genetic probe design, the radius d 
is often small. As a direct application of Lemma |l[ we now present a PTAS for Closest 
String when the radius d is small, i.e., d < OilogN), where TV" stands for the input size 
of the instance. Again, we focus on the construction of the center string. The basic idea 
is to choose r substrings t^, ti 2 , . . ., ti r of length L from the strings in S, keep the letters 
at the positions where U x , U 2 , . . ., ti r all agree, and try all possibilities for the rest of the 
positions. The complete algorithm is described in Figure 



Algorithm smallSubstring 

Input si, s 2 , . . . , s n G S m . 
Output a center string s G Y, L . 

1. for each r-element subset {t^, ti 2 , t{ r }, where ij. is a substring of 
length L from Sj . do 

(a) Q = {1 < j < m | t 4l [i] = t h [j]=... = t ir \j]}, P = {1, 2, . . . , m}-Q. 

(b) for every x G S' p ' do 

let i = Siti^ x, P); compute the cost of the solution t. 

2. for every length L substring from any given sequence do 

compute the cost of the solution with tk as the center string 

3. select a center string that leads the best result in Step 1 and Step 2; 
output the best solution of the above two steps. 

Figure 2: Algorithm for Closest Substring when d is small 

Theorem 8 Algorithm smallSubstring is a PTAS for Closest Substring when the radius 
d is small, i.e., d < 0(log N), where N is the input size. 

Proof. Obviously, the size of P in Step 1 is at most 0{r x logAQ. Step 1 takes 
0{{mn) r x s°( rxl °g iV ) x mnL) = 0{N r+l x A^xiog |53|)j = (jyO(rxlog|E[)) time _ other 

steps take less than that time. Thus, the total time required is 0(A^°( rxlos which is 
polynomial in term of input size for any constant r. 
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From Lemma |], the performance ratio of the algorithm is 1 + 2r ] _ 1 . I 

4 A PTAS For Closest Substring 

In this section, we further extend the algorithms for Closest String to a PTAS for Clos- 
est Substring, making use of a random sampling strategy. Note that Algorithm smallSub- 
string runs in exponential time for general radius al. And Algorithm closestString does not 
work for CLOSEST SUBSTRING since we do not know how to construct an optimal problem 
similar to (||) — The construction of (Q) requires us to know all the n strings (substrings) 
in an optimal solution of Closest String (Closest Substring). It is easy to see that 
the choice of a "good" substring from every string s» is the only obstacle on the way to the 
solution. We use random sampling to handle this. 

Now let us outline the main ideas. Let (S = {si, s%, . . . , s n }, L) be an instance of 
Closest Substring, where Si is of length m. Suppose that s is its optimal center string 
and ti is a length L substring of si which is the closest to s {i = 1,2, ... ,n). Let d op t = 
max™ =1 d(s, ti). By trying all possibilities, we can assume that i^,^, . . . ,U r are the r 
substrings ty that satisfy Lemma || by replacing Si by t{ and Si j by . Let Q be the set 
of positions where , tj 2 , . . . , ti r agree and P = {1, 2, . . . , L} — Q. By Lemma [j], \q is a 
good approximation to s\q. We want to approximate s\p by the solution x of the following 
optimization problem (|8|), where ^ is a substring of si and is up to us to choose. 

min d; (g) 
d^p.x) <d- d(i-|<2, tijg), i = 1, ■ • • , n; \x\ = \P\. 

The ideal choice is t[ = U, i.e., t[ is the closest to s among all substrings of Sj. However, 
we only approximately know s in Q and know nothing about s in P so far. So, we randomly 
pick 0(log(mn)) positions from P. Suppose the multiset of these random positions is R. 
By trying all possibilities, we can assume that we know s at these \R\ positions. We then 
find the substring t\ from s such that d(s\n, t'^p) x IS + ^(t^ |q, ^|q) is minimized. Then 
t\ potentially belongs to the substrings which are the closest to s. 

Then we solve @ approximately by the method provided in the proof of Lemma || and 
combine the solution x at P and U x at Q, the resulting string should be a good approximation 
to s. The detailed algorithm (Algorithm closestSubstring) is given in Figure ||. We prove 
Theorem |9| in the rest of the section. 

Theorem 9 Algorithm closestSubstring is a PTAS for the closest substring problem. 

Proof. Let s be an optimal center string and ti be the length-L substring of Si that is 
the closest to s. Let d op t = maxd(s,ij). Let e be any small positive number and r > 2 be 
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Algorithm closestSubstring 

Input n sequences {si, S2, ■ ■ ■ , s n } C S m , integer L. 
Output the center string s. 

1. for every r length- L substrings , ti 2 , . . . , ti r (allowing repeats, but if 
and ti k are both chosen from the same Sj then ij. = U.) of si, . . . , s n do 

(a) Q = {1 < j < L | i ia [j] = i i2 [i] = . . • = t ir \j]i P = {l,2,...,L}-Q. 

(b) Let 72 be a multiset containing \A log(nm)] uniformly random posi- 
tions from P. 

(c) for every string y of length \R\ do 

(i) for i from 1 to n do 

Let t • be a length L substring of Si minimizing d(y, ^|_r) x |^| + 

(ii) Using the method provided in the proof of Lemma ||, solve the 
optimization problem defined by Formula (||) approximately. Let 
x be the approximate solution within error e \P\. 

(hi) Let s' be the string such that s'\p = x and s'\q = U^q. Let 
c = max™ =1 min {ti is a subs tring of Sl } d{s', U). 

2. for every length-L substring s' of si do 

Let c = max™ =1 min {ti is a substring of Si} d{s', U). 

3. Output the s' with minimum c in step 1(c) (hi) and step 2. 

Figure 3: The PTAS for the closest substring problem. 

any fixed integer. Let po = maxi<jj< n d(t{, tj)/d op t. If po < 1 + j^rrp then clearly we can 
find a solution s' within ratio po in step 2. So, we assume that po > 1 + 2r 1 _ 1 from now on. 

By Lemma ^, Algorithm closestSubstring picks a group of , tj 2 , . . . , ij r in step 1 at 
some point such that 

Fact 1 For any 1 < I < n, \{j £Q\t h \j] + t t \j] and t h [j] / s\j}}\ < 

2r-l dopt- 

Obviously, the algorithm takes y as s\r for at some point in step 1(c). Let y = s\r and 
ti 1} ti 2 , . . . ,ti r satisfy Fact 1. Let t\ be defined as in step l(c)(i). Let s* be a string such 
that s*\p = s\p and s*\q = t^n. Then we claim: 

Fact 2 With high probability, < d(s*,ti) + 2e\P\ for all 1 < i < n. 

Proof. For convenience, for any position multiset T, we denote d T (t\,t2) = d(t\\T, ^It) 
for any two strings t\ and £2- Let p = jw. Consider any length L substring t' of Si satisfying 

d(s*,t')>d(s*,ti) + 2e\P\. (9) 
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It is easy to see that pd R (s*,t') + d Q {t ii: t') < pd R (s*,ti) + d Q (t il ,t i ) implies either 
{pd R (s\t') + d Q (s*,f) < d(s*,t')-e\P\ or pd R (s*,ti) + d Q (s*,ti) > d(s*, U) + e\P\. Thus, 
we have the following inequality: 



Pr [pd R (s*,t') + d Q (t h , t') < pd R (s*,t l ) + d Q {t h , U) j 
< Pr(pd R (s*,t') + d Q (s*,t') <d(s*,t')-e\P\) + 

Pr (pd R (s*,t i ) + dQ(s*,t i ) > d(s*,ti) + e\P\) . (10) 

It is easy to see that d R (s*,t') is the sum of \R\ independent random 0-1 variables 

I R\ 

J2\=i where Xi = 1 indicates a mismatch between s* and t' at the i-th position in R. 
Let p = E[d R (s*,t')]. Obviously, p = d p {s*,t')/p. Therefore, by Lemma | (2), 

Pr (pd R (s*,t') + d Q (s*,t') < d(s*,t')-e\P\J 
= Pr (d R (s*,t') < (d(s*,t') -d Q {s\t'))/p-e\R\) 
= Pr (d R {s\t') < d p (s*,t')/p-e\R\) 

(d R (s*,t') < fi-e\R\j < exp (-]-e 2 \R\) <(nm)" 2 , (H) 



= Pr 

where the last inequality is due to the setting \R\ = log(nm)] in step 1(b) of the 
algorithm. Similarly, using Lemma ||] (1) we have 

Pr (pd R (s*,ti) +d Q (s*,ti) > d(s*,U) +e|P|) <(nm)-l. (12) 

Combining Formula (|l0[) (|TT1) (|T^) , we know that for any t' that satisfies Formula (|9|), 

Pr (pd R (s* ,t') + d Q (t h ,t') < pd R (s*,t i ) + d Q (t il ,t i )^j <2(nm)~i. (13) 

For any fixed 1 < i < n, there are less than m substrings t' that satisfies Formula (^). Thus, 
from Formula (13) and the definition of t' i} 

Pr (d(s*,t'i) > d(s*,ti) +2e\P\) < 2n - tm - i (14) 

Summing up all i G [1, n], we know that with probability at least 1 — 2 (nm)~3, d(s*,t' i ) < 
d(s*,ti) + 2e\P\ for alii ■ 

From Fact 1, d(s*,U) = d p (s,ti) + d Q {t h ,t i ) < d(s,U) + 2r _i d-opt ■ Combining with 
Fact 2 and |P| < rd op t, we get 

d(s\t' l )<(l + —^— + 2er)d opt . (15) 
2r — 1 

By the definition of s*, the optimization problem defined by Formula (||) has a solution s\p 
such that d < (1 + 2r-i + 2er)d op t. We can solve the optimization problem within error 
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e\P\ by the method in the proof of Lemma || Let x be the solution of the optimization 
problem. Then by Formula (||), for any 1 < i < n, 

d(t' t \ P ,x) < (l + -J— + 2er)d opt -d{t' i \ Q ,t il \ Q ) + e\P\. (16) 

Let s' be defined in step 1(c) (iii), then by Formula (|i"6|), 

d(s',t-) = d(x,ti\ P ) +d{t il \Q,t' i \ Q ) 

< {l + —^— + 2er)d opt + e\P\ 

2r — 1 

< (1 + + 3er)d opt . 

2r — 1 

It is easy to see that the algorithm runs in polynomial time for any fixed positive r 
and e. For any 5 > 0, by properly setting r and e such that 2 r-i ^ er — ^' high 
probability, the algorithm outputs in polynomial time a solution s' such that d(t-,s') is no 
more than (1 + 5)d op t for every 1 < % < n, where t\ is a substring of Sj. The algorithm can 
be derandomized by standard methods 15]. I 
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